Systems and methods for determining relationships among data elements

ABSTRACT

A data processing system configured to perform: obtaining a first data lineage representing relationships among physical data elements, the first data lineage being generated at least in part by performing at least one of: (a) analyzing source code of at least one computer program configured to access the physical data elements; and (b) analyzing information obtained during runtime of the at least one computer program; obtaining, based on user input, a second data lineage representing relationships among business data elements; obtaining an association between at least some of the physical data elements of the first data lineage and at least some of the business data elements of the second data lineage; and generating, based on the association between the physical data elements and the business data elements, an indication of agreement or discrepancy between the first data lineage and the second data lineage.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 120 and is acontinuation of U.S. patent application Ser. No. 15/807,897 entitled“SYSTEMS AND METHODS FOR DETERMINING RELATIONSHIPS AMONG DATA ELEMENTS”,filed Nov. 9, 2017, which claims the benefit under 35 U.S.C. § 119(e) ofU.S. Provisional Application Ser. No. 62/419,826, titled “SYSTEMS ANDMETHODS FOR DETERMINING RELATIONSHIPS AMONG DATA ELEMENTS”, filed onNov. 9, 2016, each application of which is incorporated by referenceherein in its entirety.

BACKGROUND

Organizations that manage large amounts of data often wish to obtaindata lineage for at least some of the data being managed. Data lineagefor a set of data being managed may include information indicating howthe set of data was obtained, how the set of data may change over time,and/or how the set of data may be used by one or more data processingsystems and/or processes. Data lineage for a set of data may includeupstream lineage information indicating how the set of data wasobtained. For example, upstream lineage information may identify one ormore data sources from which the set of data was obtained and/or one ormore data processing operations that have been applied to the set ofdata. Additionally or alternatively, data lineage for a set of data mayinclude downstream lineage information indicating one or more otherdatasets, processes, and/or applications that depend and/or use the setof data. An organization may wish to obtain lineage information for anysuitable set of data such as, for example, one or more data records, oneor more tables of data in a database, one or more spreadsheets of data,one or more files of data, a single data value, data used to produce oneor more reports, data accessed by one or more application programs,and/or any other suitable set of data.

There are many uses of lineage information about the data managed by anorganization's data processing systems. Examples of such uses include,but are not limited to, risk reduction, verification of regulatorycompliance obligations, streamlining of business processes, safeguardingdata, tracing errors back to their sources, and determining whetherchanges to data may lead to downstream errors. In some cases, incompleteor incorrect lineage information can lead to negative practical effectson the organization, such as records being handled incorrectly,inaccurate data being provided to members of the organization,inefficient system operation, system failures, inadvertent introductionof errors, inefficient resolution of errors, difficulty complying withregulatory processes, etc. For a business organization, such effects canquickly lead to customer and/or regulator dissatisfaction. Accordingly,it is important that lineage information is both correct and complete.

SUMMARY

Some embodiments are directed to a data processing system, comprising:at least one computer hardware processor; and at least onenon-transitory computer-readable storage medium storingprocessor-executable instructions that, when executed by the at leastone computer hardware processor, cause the at least one computerhardware processor to perform: obtaining a first data lineagerepresenting relationships among a plurality of physical data elements,the first data lineage being generated at least in part by performing atleast one of: (a) analyzing source code of at least one computer programconfigured to access at least some of the plurality of physical dataelements; and (b) analyzing information obtained during runtime of theat least one computer program; obtaining, based at least in part on userinput, a second data lineage representing relationships among aplurality of business data elements; obtaining an association between atleast some of the plurality of physical data elements of the first datalineage and at least some of the plurality of business data elements ofthe second data lineage; and generating, based on the associationbetween the plurality of physical data elements and the plurality ofbusiness data elements, an indication of agreement or discrepancybetween the first data lineage and the second data lineage.

In some embodiments, generating the indication of agreement ordiscrepancy comprises: displaying a visualization of the second datalineage showing the indication of agreement or discrepancy.

In some embodiments including any of the preceding embodiments, thesecond data lineage comprises a first link representing a firstdependency between two business data elements, and wherein displayingthe visualization of the second data lineage comprises displaying thelink in one manner when there is a dependency in the first data lineagecorresponding to the first dependency and in another manner when thereis not a dependency in the first data lineage corresponding to the firstdependency.

In some embodiments including any of the preceding embodiments,generating the indication of agreement or discrepancy comprises:determining, based on the association between the plurality of physicaldata elements and the plurality of business data elements, whether thereis one or more discrepancies among the first data lineage, the seconddata lineage, and the obtained association.

In some embodiments including any of the preceding embodiments,obtaining the first data lineage comprises generating the first datalineage at least in part by performing at least one of analyzing thesource code of the at least one computer program and analyzing theinformation obtained during runtime of the at least one computerprogram.

In some embodiments including any of the preceding embodiments,obtaining the first data lineage comprises analyzing the source code ofthe at least one computer program.

In some embodiments, obtaining the first data lineage comprisesanalyzing the information obtained during runtime of the at least onecomputer program.

In some embodiments including any of the preceding embodiments, the atleast one computer program comprises a computer program implemented as adataflow graph.

In some embodiments including any of the preceding embodiments,obtaining the association between the at least some of the plurality ofphysical data elements of the first data lineage and the at least someof the plurality of business data elements of the second data lineagecomprises generating the association based on user input provided via agraphical user interface.

In some embodiments including any of the preceding embodiments, theplurality of physical data elements comprises a first physical dataelement, the plurality of business data elements comprises a firstbusiness data element, the association indicates that the first physicaldata element and the first business data element are associated, and thedetermining comprises determining that a first set of one or moresources of data identified in the first data lineage as being used toobtain the first physical data element is different from a second set ofone or more sources of data identified in the second data lineage asbeing used to obtain the first business data element.

In some embodiments including any of the preceding embodiments, the actsof obtaining the first data lineage and determining whether there is adiscrepancy are performed repeatedly according to a specified schedule.

In some embodiments including any of the preceding embodiments, theassociation comprises an association between a first physical dataelement of the plurality of physical data elements and a first businessdata element of the plurality of business data elements, and the atleast one computer hardware processor is further configured to perform:determining, based at least in part on the association between the firstphysical data element and the first business data element, a measure ofdata quality for the first business data element.

In some embodiments including any of the preceding embodiments,determining the measure of data quality for the first business dataelement comprises: performing an analysis of data quality of data in thefirst physical data element based at least in part on one or more dataquality rules associated with the data in the first physical dataelement.

In some embodiments including any of the preceding embodiments, themeasure of data quality for the first business element includes ameasure of one or more of accuracy, completeness, and validity.

Some embodiments are directed to a method, comprising: using at leastone computer hardware processor to perform: obtaining a first datalineage representing relationships among a plurality of physical dataelements, the first data lineage being generated at least in part byperforming at least one of: (a) analyzing source code of at least onecomputer program configured to access at least some of the plurality ofphysical data elements; and (b) analyzing information obtained duringruntime of the at least one computer program; obtaining, based at leastin part on user input, a second data lineage representing relationshipsamong a plurality of business data elements; obtaining an associationbetween at least some of the plurality of physical data elements of thefirst data lineage and at least some of the plurality of business dataelements of the second data lineage; and generating, based on theassociation between the plurality of physical data elements and theplurality of business data elements, an indication of agreement ordiscrepancy between the first data lineage and the second data lineage.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining a first data lineage representing relationships amonga plurality of physical data elements, the first data lineage beinggenerated at least in part by performing at least one of: (a) analyzingsource code of at least one computer program configured to access atleast some of the plurality of physical data elements; and (b) analyzinginformation obtained during runtime of the at least one computerprogram; obtaining, based at least in part on user input, a second datalineage representing relationships among a plurality of business dataelements; obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and generating, based on the association between the pluralityof physical data elements and the plurality of business data elements,an indication of agreement or discrepancy between the first data lineageand the second data lineage.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions for execution by at least one computer hardware processor,the processor executable instructions comprising: means for obtaining afirst data lineage representing relationships among a plurality ofphysical data elements, the first data lineage being generated at leastin part by performing at least one of: (a) analyzing source code of atleast one computer program configured to access at least some of theplurality of physical data elements; and (b) analyzing informationobtained during runtime of the at least one computer program; means forobtaining, based at least in part on user input, a second data lineagerepresenting relationships among a plurality of business data elements;means for obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and means for generating, based on the association between theplurality of physical data elements and the plurality of business dataelements, an indication of agreement or discrepancy between the firstdata lineage and the second data lineage.

Some embodiments are directed to a data processing system fordetermining whether there is a discrepancy among a first data lineage, asecond data lineage, and an association between data elements of thefirst and second data lineages. The system comprises at least onecomputer hardware processor; and at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by the at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining a first data lineage representing relationships amonga plurality of physical data elements, the first data lineage beinggenerated at least in part by performing at least one of: (a) analyzingsource code of at least one computer program configured to access atleast some of the plurality of physical data elements; and (b) analyzinginformation obtained during runtime of the at least one computerprogram; obtaining, based at least in part on user input, a second datalineage representing relationships among a plurality of business dataelements; obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and determining, based on the association between the pluralityof physical data elements and the plurality of business data elements,whether there is one or more discrepancies among the first data lineage,the second data lineage, and the obtained association.

Some embodiments are directed to a method, comprising using at least onecomputer hardware processor to perform: obtaining a first data lineagerepresenting relationships among a plurality of physical data elements,the first data lineage being generated at least in part by performing atleast one of: (a) analyzing source code of at least one computer programconfigured to access at least some of the plurality of physical dataelements; and (b) analyzing information obtained during runtime of theat least one computer program; obtaining, based at least in part on userinput, a second data lineage representing relationships among aplurality of business data elements; obtaining an association between atleast some of the plurality of physical data elements of the first datalineage and at least some of the plurality of business data elements ofthe second data lineage; and determining, based on the associationbetween the plurality of physical data elements and the plurality ofbusiness data elements, whether there is one or more discrepancies amongthe first data lineage, the second data lineage, and the obtainedassociation.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining a first data lineage representing relationships amonga plurality of physical data elements, the first data lineage beinggenerated at least in part by performing at least one of: (a) analyzingsource code of at least one computer program configured to access atleast some of the plurality of physical data elements; and (b) analyzinginformation obtained during runtime of the at least one computerprogram; obtaining, based at least in part on user input, a second datalineage representing relationships among a plurality of business dataelements; obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and determining, based on the association between the pluralityof physical data elements and the plurality of business data elements,whether there is one or more discrepancies among the first data lineage,the second data lineage, and the obtained association.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions for execution by at least one computer hardware processor,the processor executable instructions comprising: means for obtaining afirst data lineage representing relationships among a plurality ofphysical data elements, the first data lineage being generated at leastin part by performing at least one of: (a) analyzing source code of atleast one computer program configured to access at least some of theplurality of physical data elements; and (b) analyzing informationobtained during runtime of the at least one computer program; means forobtaining, based at least in part on user input, a second data lineagerepresenting relationships among a plurality of business data elements;means for obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and means for determining, based on the association between theplurality of physical data elements and the plurality of business dataelements, whether there is one or more discrepancies among the firstdata lineage, the second data lineage, and the obtained association.

Some embodiments are directed to a data processing system fordetermining a measure of data quality for one or more business dataelements. The system comprises at least one computer hardware processor;and at least one non-transitory computer-readable storage medium storingprocessor executable instructions that, when executed by the at leastone computer hardware processor, cause the at least one computerhardware processor to perform: obtaining a first data lineagerepresenting relationships among a plurality of physical data elements,the first data lineage being generated at least in part by performing atleast one of analyzing source code of at least one computer programconfigured to access at least some of the plurality of physical dataelements and analyzing information obtained during runtime of the atleast one computer program; obtaining, based at least in part on userinput, a second data lineage representing relationships among aplurality of business data elements; obtaining an association between atleast some of the plurality of physical data elements of the first datalineage and at least some of the plurality of business data elements ofthe second data lineage, the association including an associationbetween a first physical data element of the plurality of physical dataelements and a first business data element of the plurality of businessdata elements; and determining a measure of data quality for the firstbusiness data element based at least in part on at least one dataquality measure associated with the first physical data element and theassociation between the first physical data element and the firstbusiness data element.

In some embodiments, determining the measure of data quality for thefirst business data element comprises performing an analysis of dataquality of data in the first physical data element based at least inpart on one or more data quality rules associated with the data in thefirst physical data element to obtain the at least one data qualitymeasure associated with the first physical data element.

In some embodiments, the data processing system of claim 18, wherein themeasure of data quality for the first business element includes ameasure of one or more of accuracy, completeness, and validity.

Some embodiments are directed to a method comprising using at least onecomputer hardware processor to perform: obtaining a first data lineagerepresenting relationships among a plurality of physical data elements,the first data lineage being generated at least in part by performing atleast one of analyzing source code of at least one computer programconfigured to access at least some of the plurality of physical dataelements and analyzing information obtained during runtime of the atleast one computer program; obtaining, based at least in part on userinput, a second data lineage representing relationships among aplurality of business data elements; obtaining an association between atleast some of the plurality of physical data elements of the first datalineage and at least some of the plurality of business data elements ofthe second data lineage, the association including an associationbetween a first physical data element of the plurality of physical dataelements and a first business data element of the plurality of businessdata elements; and determining a measure of data quality for the firstbusiness data element based at least in part on at least one dataquality measure associated with the first physical data element and theassociation between the first physical data element and the firstbusiness data element.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining a first data lineage representing relationships amonga plurality of physical data elements, the first data lineage beinggenerated at least in part by performing at least one of analyzingsource code of at least one computer program configured to access atleast some of the plurality of physical data elements and analyzinginformation obtained during runtime of the at least one computerprogram; obtaining, based at least in part on user input, a second datalineage representing relationships among a plurality of business dataelements; obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage, the association including an association between a firstphysical data element of the plurality of physical data elements and afirst business data element of the plurality of business data elements;and determining a measure of data quality for the first business dataelement based at least in part on at least one data quality measureassociated with the first physical data element and the associationbetween the first physical data element and the first business dataelement.

Some embodiments are directed to at least one non-transitorycomputer-readable storage medium storing processor executableinstructions for execution by at least one computer hardware processor,the processor executable instructions comprising: means for obtaining afirst data lineage representing relationships among a plurality ofphysical data elements, the first data lineage being generated at leastin part by performing at least one of analyzing source code of at leastone computer program configured to access at least some of the pluralityof physical data elements and analyzing information obtained duringruntime of the at least one computer program; means for obtaining, basedat least in part on user input, a second data lineage representingrelationships among a plurality of business data elements; means forobtaining an association between at least some of the plurality ofphysical data elements of the first data lineage and at least some ofthe plurality of business data elements of the second data lineage, theassociation including an association between a first physical dataelement of the plurality of physical data elements and a first businessdata element of the plurality of business data elements; and means fordetermining a measure of data quality for the first business dataelement based at least in part on at least one data quality measureassociated with the first physical data element and the associationbetween the first physical data element and the first business dataelement.

The foregoing is a non-limiting summary of the invention, which isdefined by the attached claims.

BRIEF DESCRIPTION OF DRAWINGS

Various aspects and embodiments will be described with reference to thefollowing figures. It should be appreciated that the figures are notnecessarily drawn to scale. Items appearing in multiple figures areindicated by the same or a similar reference number in all the figuresin which they appear.

FIG. 1 is a block diagram of an illustrative computing environment, inwhich some embodiments of the technology described herein may operate.

FIG. 2 is an illustrative graphical representation of an illustrativederived data lineage, in accordance with some embodiments of thetechnology described herein.

FIG. 3A is a diagram illustrating an association between auser-specified lineage and a derived data lineage, in accordance withsome embodiments of the technology described herein.

FIG. 3B is another diagram illustrating an association between auser-specified lineage and a derived data lineage, in accordance withsome embodiments of the technology described herein.

FIG. 3C is another diagram illustrating an association between auser-specified lineage and a derived data lineage, in accordance withsome embodiments of the technology described herein.

FIG. 3D is another diagram illustrating an association between auser-specified lineage and a derived data lineage, in accordance withsome embodiments of the technology described herein.

FIG. 4A is a diagram illustrating a graphical interface through which abusiness data element may be associated with a physical data element, inaccordance with some embodiments of the technology described herein.

FIG. 4B is a diagram illustrating another graphical interface throughwhich a physical data element may be associated with a business dataelement, in accordance with some embodiments of the technology describedherein.

FIG. 5 is a flowchart of an illustrative process for obtaining anassociation between a user-specified data lineage and a derived datalineage and using the obtained association to determine whether thereare any discrepancies among the user-specified data lineage, the deriveddata lineage, and the association between them, in accordance with someembodiments of the technology described herein.

FIGS. 6A-B are diagrams of illustrative graphical interfaces showinginformation about a business data element “credit score,” in accordancewith some embodiments of the technology described herein.

FIG. 6C is diagram of an illustrative user interface presenting aderived data lineage for the business data element “credit score,” inaccordance with some embodiments of the technology described herein.

FIG. 6D is a diagram of an illustrative user interface presenting auser-specified data lineage for the business data element “creditscore,” in accordance with some embodiments of the technology describedherein.

FIG. 6E is a diagram of an illustrative user interface indicatingpresence of a discrepancy between the user-specified and derivedlineages for the business data element “credit score,” in accordancewith some embodiments of the technology described herein.

FIG. 7 is a block diagram of an illustrative computing systemenvironment that may be used in implementing some embodiments of thetechnology described herein.

FIG. 8A is a diagram of an illustrative user interface presenting auser-specified data lineage, in accordance with some embodiments of thetechnology described herein.

FIG. 8B is a diagram of an illustrative user interface providing detailsabout dependency between two business data elements in theuser-specified data lineage of FIG. 8A, in accordance with someembodiments of the technology described herein.

FIG. 8C is a diagram of an illustrative user interface presenting aderived data lineage corresponding to a portion of the user-specifieddata lineage of FIG. 8A, in accordance with some embodiments of thetechnology described herein.

FIG. 8D is a diagram of an illustrative user interface presentinginformation about a node in the user-specified data lineage of FIG. 8A,in accordance with some embodiments of the technology described herein.

FIG. 8E is a diagram of an illustrative user interface presentinginformation about a physical data element associated with a businessdata element in the user-specified data lineage of FIG. 8A.

FIG. 8F is a diagram of an illustrative user interface providing detailsabout dependency between two other business data elements in theuser-specified data lineage of FIG. 8A, in accordance with someembodiments of the technology described herein.

DETAILED DESCRIPTION

The inventors have recognized and appreciated that accuracy,auditability efficiency, and reliability of a data processing system maybe improved by techniques that facilitate generating accurate andcomplete lineage information for data managed by the data processingsystem. Such techniques may be used to identify the presence of problemsin data processing systems and facilitate their resolution, therebyimproving functionality of data processing systems and reducing dataprocessing errors. The inventors have further recognized and appreciatedtechniques for improving conventional approaches to generating datalineage information.

Some conventional techniques for generating data lineage information aremanual. Although using manual techniques for generating data lineageinformation allows for customizing the generated data lineageinformation to include terminology understood by and information ofinterest to the people requesting the data lineage information, thereare numerous disadvantages. First, the accuracy of data lineageinformation generated using conventional manual techniques cannot beautomatically verified. For example, when a person manually creating adata lineage for a report indicates that some data used for generatingthe report originated from a particular data source (e.g., a databasesystem at a particular location), that indication cannot be verified inany way other than by manually re-checking the person's work. Second,manually generated data lineage information quickly becomes stale asdata managed by a data processing system frequently changes, forexample, because of the removal and/or addition of data sources,migration of data, changes to data processing logic, and the like. Suchchanges occur at a fast rate with which conventional manual lineagegeneration techniques cannot keep up.

Automated techniques for generating data lineage information may addresssome of these shortcomings. For example, automated date lineagegeneration techniques may be executed repeatedly such that the datalineage information generated is up-to-date. As another example, thegenerated data lineage information may be verified by one or morecomputer programs. However, automated techniques for generating datalineage information also have some disadvantages. For example, datalineage information produced by an automated technique (e.g., atechnique based on analyzing the source code of one or more applicationsoperating on data managed by a data processing system) may includeterminology (e.g., technical names of variables and data record fields)that is not easily understood by the people (e.g., business people)viewing the data lineage information. As another example, theautomatically generated data lineage information may include much moreinformation than the people viewing it wish to see. For instance,automatically generated data lineage information may include detailedinformation about each and every transformation applied to the dataincluding some that are likely inconsequential to the people viewing thelineage (e.g., sorting data records according to a key to extractinformation about all customers whose last names begin with “A” may be atransformation that is not of interest to a bank executive interested inthe lineage of a data value indicating the credit score of a bankcustomer whose last name is “Armstrong”).

The inventors have recognized and appreciated that both manually andautomatically obtained information provides useful information that canbe used to refine the overall data lineage. Accordingly, someembodiments provide for improved techniques for generating data lineageinformation. Rather than using only manually-generated data lineageinformation or only automatically-generated data lineage information,each of which has drawbacks including those described above, thetechniques developed by the inventors and described herein provide forgenerating accurate and complete data lineage information by: (1)obtaining manually generated data lineage information (termed“user-specified data lineage” or “user-specified lineage” or “statedlineage” herein); (2) obtaining automatically generated data lineageinformation (termed “derived data lineage” or “derived lineage” herein);and (3) obtaining an association between the user-specified and deriveddata lineages (e.g., by generating an association or accessing apreviously generated association). The obtained association may be usedto address at least some of the above-described drawbacks of usingeither type of data lineage information alone. As one example, theassociation between a user-specified data lineage and a derived datalineage may be used to verify the accuracy of the user-specified datalineage and, more generally, to identify discrepancies orinconsistencies between these two types of lineages. As another example,the association between a user-specified lineage and a derived datalineage may map information in the derived data lineage, often expressedusing technical terminology, to business terminology more readilyaccessible by consumers of data lineage information. As yet anotherexample, the association between a user-specified data lineage and aderived data lineage may be used to verify the accuracy of the deriveddata lineage. Identifying errors in the derived data lineage (e.g., viaan inconsistency with the user-specified lineage) allows for theidentification of problems with underlying data processing systems, thecommunication links among them, and/or data processing errors. In turn,identifying and addressing such problems improves the functionality ofthe underlying data processing systems and reduces data processingerrors. Because a derived data lineage provides extremely detailedinformation about the flow of data, finding errors from such detailedinformation is very difficult—it is akin to finding a needle in ahaystack. Associating a user-specified lineage to the derived datalineage, in accordance with the embodiments described herein,facilitates identifying any data processing errors in a way that thederived data lineage alone does not.

The techniques developed by the inventors and described herein improvedata processing systems. First, the techniques described herein providean improvement over conventional data lineage techniques, which areincluded in many data processing systems. Second, the techniquesdescribed herein allow for generating indications of agreement and/ordiscrepancy between user-specified and derived data lineages, whichallows for the identification of errors in either type of lineage and,as a result, facilitates identifying and resolving data processingerrors in data processing systems.

Some embodiments described herein address all of the above-describedissues that the inventors have recognized with conventional techniquesfor generating data lineage information. However, not every embodimentdescribed below addresses every one of these issues, and someembodiments may not address any of them. As such, it should beappreciated that embodiments of the technology described herein are notlimited to addressing all or any of the above-discussed issues ofconventional techniques for generating data lineage information.

In some embodiments, a data processing system may be configured to: (1)obtain a derived data lineage representing relationships among physicaldata elements; (2) obtain a user-specified data lineage representingrelationships among business data elements; (3) obtain an associationbetween the derived data lineage and the user-specified data lineage(e.g., by generating an association between at least some of thephysical data elements of the derived data lineage and at least some ofthe business data elements of the user-specified data lineage); and (4)generating, based on the association between the plurality of physicaldata elements and the plurality of business data elements, an indicationof agreement or discrepancy between the first data lineage and thesecond data lineage.

In some embodiments, generating the indication of agreement ordiscrepancy comprises: displaying a visualization of the second datalineage showing the indication of agreement or discrepancy. Non-limitingexamples of such visualizations are provided herein in FIGS. 6A-6E and8A-8F. For example, the user-specified data lineage may include a firstlink representing a first dependency between two business data elements,and displaying the visualization of the user-specified data lineage maycomprise displaying the link in one manner (e.g., using a thick line asshown in FIG. 8A) when there is a dependency in the derived data lineagecorresponding to the first dependency and in another manner (e.g., usinga thin line as shown in FIG. 8A) when there is not a dependency in thederived data lineage corresponding to the first dependency.

In some embodiments, generating the indication of agreement ordiscrepancy determining, based on the association between the deriveddata lineage and the user-specified data lineage, whether there is anydiscrepancy among the derived data lineage, the user-specified datalineage, and the association between the derived and user-specified datalineages.

In some embodiments, a physical data element may be any data elementstored and/or processed by a data processing system. For example, aphysical data element may be a field in a data record, and the value ofthe physical data element may be the value stored in the field of thedata record. As another example, a physical data element may be a cellin a table (e.g., a cell occurring at a particular row and column of thetable) and the value of the physical data element may be the value inthe cell of the table. As yet another example, a physical data elementmay be a variable (e.g., in a report) and the value of the physicalelement may be value of the variable (e.g., in a particular instance ofthe report).

In some embodiments, a business data element may be any data elementrepresenting a conceptual quantity having relevance to a business. Abusiness data element may be referred to (e.g., named and/or identified)by using natural language familiar to a business user (e.g., a businessterm). There may be one or multiple physical data elements thatcorrespond to the business data element in that they store one ormultiple values that are instances of the conceptual quantity, which thebusiness data element represents. One example of a business data elementmay be a bank customer's credit score, which is a conceptual quantityrelevant to a bank's business. There may be one or more physical dataelements (e.g., in one or more tables, files, spreadsheets, datastreams, etc.) storing values representing the bank customer's creditscore. In this example, there may be multiple physical data elementscorresponding to the business data element because the customer's creditscore may be stored in multiple locations or because there are multipledifferent credit scores for the customer (e.g., different credit scoresprovided by different credit rating agencies). Thus, there may be one ormultiple physical data elements corresponding to a single business dataelement. On the other hand, in some embodiments, there may be only asingle business data element corresponding to a particular physical dataelement. A business data element may take on a value of a correspondingphysical data element.

It should be appreciated that although there may be one or more physicalelements corresponding to a business data element, a conventional dataprocessing system may not have access to information indicating such acorrespondence. Without access to such information, a data processingsystem may not be able to automatically identify which physical dataelement(s) correspond to a business data element and/or which businessdata element corresponds to one or more physical data element(s). Bycontrast, some embodiments of the technology described herein providefor generating and storing an association between physical and businessdata elements. The generated association between a physical data elementand a business data element may constitute information indicating thecorrespondence between the physical and business data elements. In someembodiments, a data processing system may use such associations todetermine, automatically, which physical data elements and business dataelements correspond to one another.

In some embodiments, a derived data lineage may include informationabout the lineage of one or physical data elements stored and/orprocessed by a data processing system. Information about the lineage ofa physical data element may include upstream lineage informationindicating how the value of the physical data element was obtained. Forexample, the upstream lineage information may identify data (e.g., oneor more other physical data elements) from which the physical dataelement was obtained and/or one or more transformations that have beenapplied to the data. Information about the lineage of the physical dataelement may, additionally or alternatively, include downstream lineageinformation indicating one or more other datasets, physical dataelements, processes, and/or applications that depend on the value of thephysical data element.

In some embodiments, a derived data lineage may be obtained by analyzingthe source code of at least one computer program configured to access(e.g., read, write, and modify) at least some of the plurality ofphysical data elements managed by a data processing system. The sourcecode analysis may be performed by using any suitable static codeanalysis techniques and/or any other suitable technique(s). The sourcecode analysis may be used to identify one or more physical data elementsinput and/or accessed by the computer program, identify one or moretransformations applied to the inputs and/or computations performedusing the inputs as part of the computer program, and/or identify one ormore outputs of the computer program. In some embodiments, the computerprogram may comprise a dataflow graph.

In some embodiments, in addition to or instead of analyzing the sourcecode of one or more computer programs, a derived data lineage may beobtained by analyzing information obtained during runtime of the atleast one computer program. For example, in some embodiments, one ormore logs generated during runtime of a computer program may be analyzedto identify inputs to the computer program, one or more transformationsapplied to the inputs and/or computations performed using the inputs aspart of the computer program, and/or one or more outputs of the computerprogram.

In some embodiments, a user-specified data lineage may be specified by auser and may represent relationships among business data elements. Theuser-specified lineage may include upstream and downstream lineageinformation. For example, the user-specified lineage may includeinformation indicating one or more other business data elements used togenerate (e.g., calculate) a business data element of interest to thebusiness (e.g., a credit score of a bank customer). In some embodiments,one or more graphical user interfaces may be provided to the user tofacilitate his/her specifying a user-specified data lineage.

In some embodiments, obtaining an association between a derived datalineage and a user-specified data lineage may be performed by generatingan association between one or more physical data elements in the deriveddata lineage and one or more corresponding business data elements in theuser-specified data lineage. In some embodiments, an association betweena physical data element and a business data element may be generatedautomatically, for example, based on metadata (e.g., names) of thephysical and business data elements. In some embodiments, an associationbetween a physical data element and a business data element may begenerated based on user input specifying the association. In suchembodiments, one or more graphical user interfaces may be provided tothe user to facilitate his/her specifying the association.

In some embodiments, the association between a derived data lineage anda user-specified data lineage may be used to determine whether there isa discrepancy between these types of lineages. For example, when theassociation between the lineages that business data element “B” isassociated with physical data element “P”, determining whether there isa discrepancy may include determining that a first set of one or moresources of data identified in the derived data lineage as being used toobtain a physical data element P is different from a second set of oneor more sources of data identified in the user-specified lineage asbeing used to obtain the business data element B.

In some embodiments, the derived data lineage may be updated and thedetermination of whether there is a discrepancy between the derived datalineage and the user-specified data lineage may be repeated. In thisway, discrepancies between the lineages that could arise because ofchanges to the data managed by the data processing system may bedetected.

It should be appreciated that an association between a derived datalineage and a user-specified data lineage is not limited to being usedfor identifying discrepancies between the lineages and may be used forany other suitable purpose. For example, in some embodiments, theassociation between the lineages may be used to obtain a measure of dataquality for one or more business data elements.

In some embodiments, quality of data in one or more physical dataelements may be evaluated. For example, quality of the data may beevaluated using predefined data quality rules, which may define criteriafor evaluating the values of physical data elements, such as byidentifying characteristics (e.g., accuracy, precision, completeness,and validity) of the values according to the criteria. The extent towhich the values exhibit these characteristics may thereby produce ameasure of data quality for the physical data elements and, by virtue ofthe association between the physical and business data elements, ameasure of data quality for the business data elements.

Accordingly, in some embodiments, a data processing system may beconfigured to: (1) obtain a derived data lineage representingrelationships among physical data elements; (2) obtain a user-specifieddata lineage representing relationships among business data elements;(3) obtain an association between the derived data lineage and auser-specified data lineage, the association including an associationbetween a first physical data element in the derived data lineage and afirst business data element in the user-specified data lineage; and (4)determine, based on the association between the derived data lineage andthe user-specified data lineage and a measure of data quality for afirst physical data element, a measure of data quality for the firstbusiness data element.

It should be appreciated that the embodiments described herein may beimplemented in any of numerous ways. Examples of specificimplementations are provided below for illustrative purposes only. Itshould be appreciated that these embodiments and thefeatures/capabilities provided may be used individually, all together,or in any combination of two or more, as aspects of the technologydescribed herein are not limited in this respect.

FIG. 1 is a block diagram of an illustrative computing environment 100,in which some embodiments of the technology described herein mayoperate. Computing environment 100 includes data processing system 105,which is configured to operate on data stored in data store 104.

In some embodiments, data store 104 may include one or multiple storagedevices storing data in one or more formats of any suitable type. Forexample, the storage device(s) part of data store 104 may store datausing one or more database tables, spreadsheet files, flat text files,and/or files in any other suitable format (e.g., a native format of amainframe). The storage device(s) may be of any suitable type and mayinclude one or more servers, one or more database systems, one or moreportable storage devices, one or more non-volatile storage devices, oneor more volatile storage devices, and/or any other device(s) configuredto store data electronically. In some embodiments, data store 104 mayinclude one or more online data streams in addition to or instead ofstorage device(s). Accordingly, in some embodiments, data processingsystem 105 may have access to data provided over one more data streamsin any suitable format.

In embodiments where data store 104 includes multiple storage devices,the storage devices may be co-located in one physical location (e.g., inone building) or distributed across multiple physical locations (e.g.,in multiple buildings, in different cities, states, or countries). Thestorage devices may be configured to communicate with one another usingone or more networks such as, for example, network 106 shown in FIG. 1.

In some embodiments, the data stored by the storage device(s) mayinclude one or multiple data entities such as one or more files, tables,data in rows and/or columns of tables, spreadsheets, datasets, datarecords (e.g., credit card transaction records, phone call records, andbank transaction records), fields, variables, messages, and/or reports.The storage device(s) may store thousands, millions, tens of millions,or hundreds of millions of data entities. Each data entity may includeone or multiple physical data elements.

A physical data element may be any data element stored and/or processedby a data processing system. For example, a physical data element may bea field in a data record, and the value of the physical data element maybe the value stored in the field of the data record. As a specificnon-limiting example, a physical data element may be a field storing acaller's name in a data record storing information about a phone call(which data record may be part of multiple data records about phonecalls made by customers of a telecommunication's company) and the valueof the physical data element may be the value stored in the field. Asanother example, a physical data element may be a cell in a table (e.g.,a cell occurring at a particular row and column of the table) and thevalue of the physical data element may be the value in the cell of thetable. As another example, a physical data element may be a variable(e.g., in a report) and the value of the physical element may be valueof the variable (e.g., in a particular instance of the report). As aspecific non-limiting example, a physical data element may be a variablein a report about a bank loan applicant representing the applicant'scredit score, and the value of the physical data element may be thenumeric value of the credit score (e.g., a numeric value between 300 and850). The value of the physical data element representing theapplicant's credit score may change depending on the data used togenerate the report about the bank loan applicant.

In some embodiments, a physical data element may take on a value of anysuitable type. For example, a physical data element may take on anumeric value, an alphabetic value, a value from a discrete set ofoptions (e.g., a finite set of categories), or any other suitable typeof value, as aspects of the technology described herein are not limitedin this respect.

Data processing system 105 may include one or multiple computer programs109 configured to operate on data in data store 104. The computerprograms 109 may be of any suitable type and written in any suitableprogramming language(s). For example, in some embodiments, computerprograms 109 may include one or more computer programs written at leastin part using the structured query language (SQL) and configured toaccess data in one or more databases part of data store 104. As anotherexample, in some embodiments, data processing system 105 is configuredto execute programs in the form of graphs and computer programs 109 maycomprise one or more computer programs developed as dataflow graphs. Adataflow graph may include components, termed “nodes” or “vertices,”representing data processing operations to be performed on input dataand links between the components representing flows of data. Techniquesfor executing computations encoded by dataflow graphs is described inU.S. Pat. No. 5,966,072, titled “Executing Computations Expressed asGraphs,” which is incorporated by reference herein in its entirety.

In the illustrated embodiment of FIG. 1, data processing system 105further includes development environment 108 that may be used by aperson (e.g., a developer) to develop one or more of computer programs109 for operating on data in data store 104. For example, in someembodiments, user 102 may use computing device 103 to interact withdevelopment environment to specify a computer program, such as adataflow graph, and save the computer program as part of computerprograms 109. An environment for developing computer programs as dataflow graphs is described in U.S. Pat. Pub. No.: 2007/0011668, titled“Managing Parameters for Graph-Based Applications,” which isincorporated by reference herein in its entirety.

In some embodiments, one or more of computer programs 109 may beconfigured to perform any suitable operations on data in data store 104.For example, one or more of computer programs 109 may be configured toaccess data from one or more sources, transform the accessed data (e.g.,by changing data values, filtering data records, changing data formats,sorting the data, combining data from multiple sources, splitting datainto multiple portions, and/or in any other suitable way), calculate oneor more new values from accessed data, and/or write the data to one ormultiple destinations.

In some embodiments, one or more of computer programs 109 may beconfigured to perform computations on and/or generate reports from datain data store 109. The computations performed and/or reports generatedmay be related to one or more quantities relevant to a business. Forexample, a computer program may be configured to access credit historydata for a person and determine a credit score for the person based onthe credit history. As another example, a computer program may accesstelephone call logs of multiple customers of a telephone company andgenerate a report indicating how many of the customers use more datathan allowed for in their data plans. As yet another example, a computerprogram may access data indicating the types of loans made by a bank andgenerate a report indicating the overall risk of loans made by the bank.These examples are illustrative and non-limiting, as a computer programmay be configured to generate any suitable information (e.g., for anysuitable business purpose) from data stored in data store 104.

In the illustrated embodiment, data processing system 105 also includesa data governance module 110 that supports the performance of variousdata governance tasks. For example, in the illustrated embodiment, datagovernance module 110 includes data dictionary module 112, rolemanagement module 114, data quality module 116, derived lineage module118, user-specified lineage module 120, and lineage association module122, each of which comprises processor-executable instructions that,when executed, perform functionality supporting the performance of oneor more data governance tasks, as described in greater detail below.

In some embodiments, data dictionary module 112 may be configured tostore information about data in data store 104. That is, data dictionary112 may be configured to store metadata associated with data in datastore 104. For example, data dictionary 112 may store one or morealternative names for physical data elements in data store 104. In thisway, rather than referring to a physical data element by the name of thevariable to which it corresponds (which variable name may have beencreated by a programmer and is not “user-friendly” in that it does notimmediately convey to a user what information the variable represents),the data dictionary may include one or more alternative terms for thephysical data element such as, for example, a natural language term orphrase that business people would use to refer to the physical dataelement. As a specific example, the data dictionary 112 may store thename “Bank Customer Credit Score” or “Bank Customer FICO Credit Score”as an alternative name for a physical data element corresponding to avariable named “cstCrdScr,” which stores the value of a FICO creditscore for a particular bank customer. As another specific example, thedata dictionary 112 may store the name “Order Amount” as language thatmay be used for referring to the physical data element corresponding toa field named “order_amt.”

In some embodiments, role management module 114 may manage informationindicating which party or parties are responsible for various dataelements stored in data store 104. Managing such role information mayinclude storing the role information, allowing one or more users tomodify such information (e.g., by removing, adding, or changing partiesand/or their responsibilities), and/or displaying the role information.

In some embodiments, the role management module 114 may specifyresponsible parties for one or more physical data elements and/or one ormore business data elements. For example, role management module 114 maybe configured to manage information used for generating (and, in someembodiments, may be configured to generate) a graphical interfaceindicating parties accountable for management of a data element. Anillustrative example of such a graphical interface is shown in FIG. 6A,which identifies four individuals (including a business owner 602, datasteward 604, and two subject matter experts 606 and 608) accountable formanagement of the “credit score” business data element 601.

In some embodiments, data quality module 116 may be configured todetermine one or more measures of data quality for each of one or morephysical data element. The quality of data in physical data elements maybe determined in any suitable way. For example, in some embodiments, thequality of the data may be evaluated using predefined data qualityrules, which may define criteria for evaluating the values of physicaldata elements, such as by identifying characteristics (e.g., accuracy,precision, completeness, and validity) of the values according to thecriteria. The extent to which the values exhibit these characteristicsmay thereby produce a measure of data quality for the physical dataelements. Aspects of evaluating the quality of data using data qualityrules are described in U.S. Pat. Pub. No.: 2014/0108357, “Specifying andApplying Rules to Data,” which is incorporated by reference herein inits entirety.

In some embodiments, derived lineage module 118 may be configured togenerate a derived data lineage for at least some of the data in datastore 104. A derived data lineage may include information about thelineage of one or physical data elements. For example, a derived datalinage may include upstream lineage information indicating how the valueof the physical data element was obtained and/or downstream lineageinformation indicating one or more other datasets, physical dataelements, processes, and/or applications that depend on the value of thephysical data element.

In some embodiments, derived lineage module 118 may be configured togenerate a derived data lineage by analyzing the source code of at leastone computer program configured to access (e.g., read, write, andmodify) at least some of the plurality of physical data elements managedby a data processing system. The source code analysis may be used toidentify inputs to a computer program (e.g., identify one or morephysical data elements accessed by the computer program), identify oneor more transformations applied to the inputs and/or computationsperformed using the inputs as part of the computer program, and/oridentify one or more outputs of the computer program. In someembodiments, the computer program may comprise a dataflow graph.

In some embodiments, derived lineage module 118 may be configured togenerate a derived data lineage by analyzing information obtained duringruntime of the at least one computer program. For example, in someembodiments, one or more logs generated during runtime of a computerprogram may be analyzed to identify inputs to the computer program, oneor more transformations applied to the inputs and/or computationsperformed using the inputs as part of the computer program, and/or oneor more outputs of the computer program.

In some embodiments, derived lineage module 118 may be configured togenerate a derived data lineage by using one or more data discoveryprocesses. For example, in some embodiments, a computer programimplementing a data discovery may be configured to identify differentphysical data elements containing the same data values and, based onthat identification, determine that these physical data elements arerelated. For example, the computer program may be configured todetermine that a same table of data is stored in multiple differentdatabases and, on that basis, determine that the physical data elementsin these tables are related. It should be appreciated that the derivedlineage module 118 may be configured to generate a derived lineage usingany of the above-described ways or any combination of two or more of theabove-described or other ways, as aspects of the technology describedherein are not limited in this respect.

FIG. 2 is a data lineage diagram 200 of an illustrative derived datalineage. The derived data lineage and the diagram illustrating it may begenerated by derived lineage module 118. Data lineage diagram 200includes nodes 202 representing data entities and nodes 204 representingtransformations applied to the data entities. The data lineage diagram200 shows illustrates upstream lineage information for one or morephysical data elements in data entity 206. Arrows coming into a noderepresenting a transformation indicate which data entities are providedas inputs to the transformation. Arrows coming out of nodes representingtransformations of data indicate data entities into which results of thetransformations are provided. Examples of data entities are providedherein. Examples of transformations include, but are not limited to,performing calculations of any suitable type, sorting the data,filtering the data to remove one or more portions of data (e.g.,filtering data records to remove one or more data records) based on anysuitable criteria, merging data (e.g., using a join operation or in anyother suitable way), performing any suitable database operation orcommand, and/or any suitable combination of the foregoingtransformations. A transformation may be implemented using one or morecomputer programs of any suitable type including, by way of example andnot limitation, one or more computer programs implemented as dataflowgraphs.

A data lineage diagram, such as diagram 200 shown in FIG. 2, may beuseful for a number of reasons. For example, illustrating relationshipsbetween data entities and transformations may help a user to determinehow a particular physical data element was obtained (e.g., how aparticular value in a report was compute). As another example, a datalineage diagram may be used to determine which transformations wereapplied to various physical data elements and/or data entities.

In some embodiments, a derived data lineage may represent relationshipsamong physical data elements, data entities containing those physicaldata elements, and/or transformations applied to the physical dataelements. The relationships among physical data elements, data entities,and transformations, may be used to determine relationships among otherthings such as, for example, systems (e.g., one or more computingdevices, databases, data warehouses, etc.) and/or applications (e.g.,one or more computer programs that access data managed by a dataprocessing system). For example, when a physical data element part of atable in a database stored in system “A” located in one physicallocation is indicated, within a derived data lineage, to be derived fromanother physical data element part of another table in another databasestored in system “B,” then a relationship between systems A and B may beinferred. As another example, when an application program reads one ormore physical data elements from a system, a relationship between theapplication program and the system may be inferred. As yet anotherexample, when one application program accesses physical data elementsoperated on by another application program, a relationship between theapplication programs may be inferred. Any one or more of theserelationships may be shown as part of a data lineage diagram.

It should be appreciated that a data processing system may manage alarge number of physical data elements (e.g., millions, billions ortrillions of physical data elements).¹ Accordingly, derived data lineagemay represent relationships among a large number of physical dataelements, data entities containing those physical data elements, and/ortransformations applied to the physical data elements. Because a deriveddata lineage may include a large amount of information, it is importantto present that information in a manner that is digestible by theviewer. Accordingly, in some embodiments, information in a derived datalineage may be visualized at different levels of granularity. Varioustechniques for visualizing information in derived lineages and someaspects of techniques for generating and/or visualizing derived datalineages are described in: (1) U.S. Pat. App. Pub. No. 2010/0138431,titled “Visualizing Relationships Between Data Elements and GraphicalRepresentations of Data Element Attributes”; (2) U.S. Pat. App. Pub. No.2016/0232230, titled “Filtering Data Lineage Diagrams”; (3) U.S. Pat.App. Pub. No. 2016/0028580, titled “Data Lineage Summarization”; and (4)U.S. Pat. App. Pub. No. 2016/0019286, titled “Managing LineageInformation,” each of which is incorporated by reference in itsentirety. ¹For example, a data processing system managing dataassociated with credit card transactions may process billions of creditcard transactions a year and each of the transactions may includemultiple physical data elements such as, for example, credit cardnumber, date, merchant id, and purchase amount.

In some embodiments, user-specified lineage module 120 may be configuredto facilitate the specification of a user-specified lineage by a user(e.g., user 102 or any other suitable user). The user-specified lineagemodule 120 may be configured to provide one or more graphical userinterfaces to the user to facilitate his/her manually specifying alineage. The graphical user interface(s) may provide a canvas wherein auser can drag and drop graphical display elements corresponding tobusiness data elements. The graphical display elements may be connectedused links (e.g., lines, directional arrows, etc.) to indicate lineagerelationships among the business data elements represented by thegraphical display elements.

In some embodiments, a user-specified data lineage may be specified by auser and may represent relationships among business data elements. Theuser-specified lineage may include upstream and downstream lineageinformation. For example, the user-specified lineage may includeinformation indicating one or more other business data elements used togenerate (e.g., calculate) a business data element of interest to thebusiness (e.g., a credit score of a bank customer).

In some embodiments, association module 122 may be configured tofacilitate the generation of an association between a derived datalineage and a user-specified data lineage. To this end, associationmodule 122 may generate, for each of one or more business data elements,an association between a business data element and one or morecorresponding physical data elements.

In some embodiments, the association module 122 may generate anassociation between a business data element and one or morecorresponding physical data elements automatically (e.g., without userinput indicating that the business data element and the physical dataelements should be associated). This may be done in any suitable way.For example, in some embodiments, an association between a physical dataelement and a business data element may be generated automatically, forexample, based on metadata of the physical and business data elements.Such metadata may contain information including, but not limited to,names of the physical and/or business elements, types of the physicaland business data elements, relationships between the physical dataelement and one or more other physical data elements, and relationshipsbetween the business data element and one or more other physical dataelements. As one specific example, when the physical and business dataelements share at least a threshold number of attributes, theassociation module 122 may associate these elements. As another example,existing associations among data elements may inform the automaticidentification of new associations. For example, if a physical dataelement A (a field in table I storing a credit score for a bankcustomer) is associated with business data element B (credit score forthe bank customer), and data processing system determines (e.g., using adata discovery process) that physical data element A is related tophysical data element C (a field in table II storing a copy of thecredit score for the bank customer), then association module mayassociate physical data element C to business data element B.

In some embodiments, association module 122 may generate an associationbetween a physical data element and a business data element may begenerated based at least in part (or in whole) on user input specifyingthe association. In such embodiments, one or more graphical userinterfaces may be provided to allow user to specify the associationbetween the physical and business data elements. Illustrative examplesof such user interfaces are shown in FIGS. 4A and 4B.

FIG. 4A is a diagram illustrating a graphical interface 400 throughwhich a business data element 401 (“Order Amount”) may be associatedwith two corresponding physical data elements: the physical data element402 named “order_amt” in dataset “rush_order” and the physical dataelement 403 also named “order_amt” in dataset “order_fact.” Thegraphical user interface 400 may be used to remove one or both of theseassociations and/or add one or more other associations. As may beappreciated from the graphical user interface 400, business data element401 may be associated with one or multiple corresponding physical dataelements.

FIG. 4B is a diagram illustrating another graphical interface 410through which a physical data element may be associated with a businessdata element, in accordance with some embodiments of the technologydescribed herein. As shown in FIG. 4B, physical data element 402, indataset “rush_order” may be associated with business data element 401.As may be appreciated from the graphical user interface 410, physicaldata element 402 may be associated with a single corresponding businessdata element.

In some embodiments, data processing system 100 may be configured toshow information about data managed by the data processing system to oneor more users. In the embodiment illustrated in FIG. 1, data processingsystem 100 may be configured to show information about data managed bythe system to user 130 via computing device 134. The user 130 may viewany suitable information, via computing device 134 including, forexample, lineage information associated with data managed by system 100.Accordingly, user 130 may view information about a derived data lineagefor a physical data element generated by using derived data lineagemodule 118 (e.g., via any suitable type of data lineage diagram,examples of which are provided herein), information about auser-specified data lineage generated at least in part by usinguser-specified lineage module 120, and information indicating theassociation between the derived data lineage and the user-specified datalineage (e.g., as described below with reference to FIGS. 3A-3D).

Each of computing devices 103 and 134 may be any suitable type ofcomputing device, fixed or portable, as aspects of the technologydescribed herein are not limited in this respect. In addition, computingdevices 103 and 134 need not be the same type of computing device.Computing devices 103 and 134, data processing system 105 and data store104 are configured to communicate with one another via network 106.Network 106 may be any suitable type of network such as the Internet, anintranet, a wide area network, a local area network, and/or any othersuitable type of network.

As described above, in some embodiments, the association between aderived data lineage and a user-specified data lineage may be used todetermine whether there is a discrepancy between these types oflineages. For example, as shown in FIGS. 3A and 3B, the associationbetween a derived data lineage and a user-specified data lineage may beused to determine that the derived and user-specified data lineagesindicate different data sources for associated physical and businessdata elements.

FIG. 3A is a diagram illustrating an association between an exampleuser-specified lineage 300 and an example derived data lineage 320, inaccordance with some embodiments of the technology described herein.Each of user-specified lineage 300 and derived data lineage 320 may beobtained in any of the ways described herein. It should be appreciatedthat user-specified and derived data lineages may be more complex thanthe lineages shown in FIG. 3A and, for example, may include many morebusiness data elements, physical data elements, data entities, businessdata containers, and the like. The examples of lineages shown in FIG. 3Aare being used for ease of exposition and not by way of limitation.

Derived data lineage 320 includes data entities 340, 342, 344, 346, 348,and 350. Each of the data entities may be stored in different systemsand/or computing devices. Alternatively two or more (or all) of the dataentities may be stored in one system and/or computing device. Examplesof data entities are provided herein. Each data entity may include oneor multiple physical data elements. Data entity 340 contains one or morephysical data elements including physical data element 322. Data entity342 contains one or more physical data elements including physical dataelement 324. Data entity 344 contains multiple physical data elementsincluding physical data elements 326, 328, and 330. Data entity 346includes one or more physical data elements including physical dataelement 332. Data entity 348 includes one or more physical data elementsincluding physical data element 334. Data entity 350 includes one ormore physical data elements including physical data element 336.

In some embodiments, a derived data lineage may include upstream datalineage information for one or more physical data elements, whichprovides information about how the physical data element(s) wereobtained and/or generated. For example, in the illustrative example ofFIG. 3A, derived data lineage 320 includes upstream data lineageinformation for physical data element 322. As indicated by the shadingshown in FIG. 3A, physical data element 322 was obtained from physicaldata element 324, which was obtained from multiple physical dataelements including physical data element 326, which was obtained fromphysical data element 332. Accordingly, physical data element 322 wasobtained based, at least in part, on physical data element 332 in dataentity 346.

User-specified data lineage 320 includes data containers 303, 305, 307,and 309. A data container may be any suitable container forencapsulating a business data element. The data container may be used topresent the business data element to a business user. For example, adata container may be a report, a spreadsheet, a presentation having oneor more slides, a text file, a Word document, and/or a PDF file. In someembodiments, the content in the data container may be generated by auser, for example, by performing a database query (e.g., a SQL query)and placing the results of the database query into the data container.As a specific non-limiting example, a user creating a user-specifieddata lineage may perform a database query and insert a table returned asa result of the query into a spreadsheet file.

As shown in FIG. 3A, data container 303 includes one or more businessdata elements including business data element 302. Data container 305includes one or more business data elements including business dataelement 304. Data container 307 includes one or more business dataelements including business data element 306. Data container 309includes one or more business data elements including business dataelement 308.

In some embodiments, a user-specified data lineage may include upstreamdata lineage information, which provides information about how thebusiness data element(s) were obtained and/or generated, and/ordownstream lineage information for one or more business data elements,which provides information indicating which other business dataelement(s) depend on the business data element(s). For example, in theillustrative example of FIG. 3A, user specified lineage 300 includesupstream data lineage information for business data element 302. Asshown in FIG. 3A, the user-specified lineage 300 indicates that businessdata element 302 was obtained from business data element 304, which wasobtained from business data element 306, which was obtained frombusiness data element 308.

As discussed herein, in some embodiments, an association may begenerated between a user-specified lineage and a derived data lineage bygenerating an association between one or more physical data elements inthe derived data lineage and one or more corresponding business dataelements in the user-specified data lineage. An illustrative example ofsuch an association is shown in FIG. 3A, which shows that: (1) businessdata element 302 is associated with physical data element 322 viaassociation link 352; (2) business data element 304 is associated withphysical data element 324 via association link 354; (3) business dataelement 306 is associated with physical data element 326 via associationlink 356; and (4) business data element 308 is associated with physicaldata element 332 via association link 358. As may be appreciated fromthe example of FIG. 3A, an association between a user-specified datalineage and a derived data lineage may comprise information specifyingone or more association links between data elements in the lineages.FIG. 3B shows a simplified version of FIG. 3A, with data entities 340,342, 344, 346, 348, and 350 and data containers 303, 305, 307, and 309omitted.

In some embodiments, the association between a derived data lineage anda user-specified data lineage may be used to determine whether there isa discrepancy between the lineages. For example, the association shownin FIG. 3A indicates that there is no discrepancy between theuser-specified lineage for the business data element 302 and deriveddata lineage for the physical data element 322, which is associated withthe business data element 302. In this example, every physical dataelement in the upstream derived data lineage of physical data element322 is associated with a corresponding business data element in theupstream user-specified data lineage for the business data element 302.For example, physical data element 332, which is used to obtain physicaldata element 322, according to the derived data lineage 320, isassociated with business data element 308, which is used to obtainbusiness data element 302, according to the user-specified data lineage300.

By contrast, the association shown in FIG. 3C indicates that there is adiscrepancy between the user-specified lineage 300 and the derived datalineage 320, which has been updated to reflect changes to the datamanaged by the underlying data processing system. As a result of thechanges to the derived data lineage 320, the physical data element 322is now obtained by using physical data element 336, as indicated by theshading in FIG. 3C, rather than physical data element 332, as shown inFIG. 3B. As a result, not every physical data element in the upstreamderived data lineage of physical data element 322 is associated with acorresponding business data element in the upstream user-specified datalineage for the business data element 302. As shown in FIG. 3C, physicaldata element 336 which is used to obtain physical data element 322 isnot associated with a business data element, in the user-specified datalineage 300, used to obtain business data element 302, which is thebusiness data element associated with physical data element 322.Moreover, although physical data element 332 is not used to generatephysical data element 322 according to the derived data lineage, it isnonetheless associated with business data element 308, which is used togenerate business data element 302 according to the user-specified datalineage. These discrepancies may be identified automatically using thetechnology described herein and a user may be alerted to their presenceand/or one or more automated actions to resolve the discrepancies may betaken (e.g., by changing the user-specified data lineage and/ornotifying one or more users to implement such a change).

As illustrated in FIGS. 3A, 3B, and 3C, in some embodiments, anassociation between a user-specified data lineage and a derived datalineage includes an association between business data elements in theuser-specified data lineage and physical data elements in the deriveddata lineage. In some embodiments, the association between auser-specified data lineage and a derived data lineage may furtherinclude an association between transformations in the user-specifieddata lineage and the derived data lineage. A transformation in auser-specified data lineage may indicate how a business data element isobtained from one or more other business data elements. A transformationin a derived data lineage may indicate how a physical data element isobtained from one or more other physical data elements. Examples oftransformations are provided herein.

An example of an association between transformations in user-specifiedand derived data lineages is shown in the example illustrated in FIG.3D. In FIG. 3D, user-specified data lineage 300 further includestransformation 310, which is applied to business data elements 308 and309 to obtain business data element 306. Derived data lineage 320further includes transformation 323, which is applied to physical dataelements 332 and 334 to obtain physical data element 326. As shown inFIG. 3D, the transformations 310 and 323 are associated with one anothervia association link 357. Although only one transformation is shown inFIG. 3D for each of user-specified data lineage 300 and derived datalineage 320, it should be appreciated that each lineage may include anysuitable number of transformations, as aspects of the technologydescribed herein are not limited in this respect. For example, a deriveddata lineage may include a transformation between linked pairs of dataentities and/or physical data entities (see e.g., transformations 204shown in FIG. 2).

FIG. 5 is a flowchart of an illustrative process 500 for obtaining(e.g., generating or accessing) an association between a user-specifiedand a derived lineage and using the obtained association to determinewhether there is any discrepancies among the user-specified lineage, thederived lineage, and the association between them, in accordance withsome embodiments of the technology described herein. Process 500 may beperformed by any suitable system and/or computing device(s) and, forexample, may be performed by data processing system 105 described withreference to FIG. 1.

Process 500 begins at act 502, where a user-specified data lineage isobtained. The user-specified data lineage may be obtained in anysuitable way. For example, the user-specified data lineage may bespecified by a user using one or more graphical user interfaces providedby the data processing system to the user in order to facilitate his/herspecifying a user-specified data lineage.

Next, process 500 proceeds to act 504, where a derived data lineage isobtained. The derived data lineage may be obtained in any of the waysdescribed herein. For example, in some embodiments, the derived datalineage may be obtained by analyzing the source code of one or morecomputer(s) program configured to access at least some of the pluralityof physical data elements managed by the data processing system. Thesource code analysis may be used to identify one or more physical dataelements input or accessed by the computer program(s), identify one ormore transformations applied to the inputs and/or computations performedusing the inputs as part of the computer program(s), and/or identify oneor more outputs of the computer program(s). Additionally oralternatively, a derived data lineage may be obtained by analyzinginformation obtained during runtime of the computer program(s). Forexample, in some embodiments, one or more logs generated during runtimeof a computer program may be analyzed to identify inputs to the computerprogram, one or more transformations applied to the inputs and/orcomputations performed using the inputs as part of the computer program,and/or one or more outputs of the computer program.

Next, process 500 proceeds to act 506, where an association between theuser-specified lineage obtained at act 502 and the derived data lineageobtained at act 504 is obtained. The association may be obtained byaccessing a previously-generated association or by generating theassociation as part of process 500. Generating an association between aderived data lineage and a user-specified data lineage may comprisegenerating an association between one or more physical data elements inthe derived data lineage and one or more corresponding business dataelements in the user-specified data lineage. Additionally, generating anassociation between a derived data lineage and a user-specified datalineage may comprise generating an association between one or moretransformations of physical data elements in the derived data lineageand one or more corresponding transformations of business data elementsin the user-specified data lineage. Once generated, the association maybe stored in one or multiple data structures by the data processingsystem so that it is available for subsequent use.

An association between user-specified and derived data lineages may begenerated in any of the ways described herein. In some embodiments, anassociation between the lineages may be generated automatically, forexample, based on metadata (e.g., names) of the physical and businessdata elements. In some embodiments, an association between the lineagesmay be generated based on user input specifying the association. In suchembodiments, one or more graphical user interfaces may be provided bythe data processing system to the user to facilitate his/her specifyingthe association. The graphical user interfaces may facilitate specifyingassociations between physical and business data elements as well asbetween transformations being applied to such elements.

Next, process 500 proceeds to act 508, where a visualization isgenerated of the association generated at act 506. The visualization mayprovide a graphical indication of which physical data elements andbusiness data elements are associated with one another. Additionally,the visualization may also provide a graphical indication of whichtransformations in the user-specified lineage and which transformationsin the data derived are associated with one another. For example, insome embodiments, the generated visualization may include one ormultiple graphical elements representing an association links betweenone or more physical data elements in the derived data lineage and theassociated business data element(s) (e.g., association links 352, 354,356, and 358 in FIG. 3A). As a specific example, the visualizationgenerated at act 508 may include: (1) a visualization of a first graphrepresenting the derived data lineage obtained at act 504, the firstgraph including nodes representing data entities, physical dataelements, and/or transformations; (2) a visualization of a second graphrepresenting the user-specified data lineage obtained at act 502, thesecond graph including nodes representing data containers, business dataelements, and/or transformations; and (3) one or more edges betweennodes in the graphs representing association links between physical andbusiness data elements and/or between transformations in the twolineages. Other non-limiting example visualizations are illustrated inFIGS. 6A-6E and 8A-8F, herein.

Next, process 500 proceeds to act 509, where a measure of data qualityis determined each of one or multiple business data elements based on ameasure of data quality for each of one or more physical data elementsassociated with the business data element(s). In some embodiments, ameasure of quality for a physical data element may be evaluated usingone or more predefined data quality rules, which may define criteria forevaluating the values of physical data elements, such as by identifyingcharacteristics (e.g., accuracy, precision, completeness, and validity)of the values according to the criteria. The extent to which the valuesexhibit these characteristics may thereby produce a measure of dataquality for the physical data elements and, by virtue of the associationbetween the physical and business data elements, a measure of dataquality for the business data elements.

Next, process 500 proceeds to decision block 510, where it is determinedwhether there is a discrepancy among the user-specified data lineageobtained at act 502, the derived data lineage obtained at act 504, andthe association obtained at act 506. In some instances, the associationbetween the two types of lineages may be correct and the discrepancy mayoccur due to a discrepancy between the lineages themselves. In otherinstances, there may be an error in the association between the twotypes of lineages and the discrepancy may occur as a result of theerror.

The discrepancy may be detected in any suitable way. For example, insome embodiments, the data processing system may check to see whether aphysical data element (e.g., physical data element 332 in FIG. 3A),which is used to obtain another physical data element (e.g., physicaldata element 322 in FIG. 3A) is associated with a business data element(e.g., business data element 308 in FIG. 3A) that is used to obtain abusiness data element (e.g., business data element 302 in FIG. 3A) thatis associated with the other physical data element (e.g., physical dataelement 322 in FIG. 3A). As another example, the data processing systemmay determine whether a first set of one or more sources of dataidentified in the derived data lineage as being used to obtain aphysical data element P is different from (or is the same as) a secondset of one or more sources of data identified in the user-specifiedlineage as being used to obtain the business data element B.

When no discrepancy is detected between the user-specified and deriveddata lineages, process 500 proceeds, via the NO branch, to decisionblock 514. On the other hand, when there is a discrepancy detected,process 500 proceeds to act 512, where an indication of the discrepancyis provided to a user. The indication may be graphical, textual, or anysuitable combination thereof. For example, the indication may beprovided as part of a graphical user interface (see e.g., FIG. 6D), atext message, an e-mail, and/or any other suitable form ofcommunication.

At decision block 514, a determination is made as to whether to refreshthe derived data lineage obtained at act 504. This determination may bemade in any suitable way. For example, in some embodiments, the deriveddata lineage may be automatically refreshed according to a schedule. Insome embodiments, a user may provide input (e.g., in response to aprompt or without being prompted) indicating whether the derived datalineage is to be refreshed. When it is determined that the derived datalineage is to be refreshed, process 500 returns to act 504 via the YESbranch. Otherwise, the process 500 completes.

It should be appreciated that process 500 is illustrative and that thereare variations of this process. For example, although in the illustratedembodiment, an indication of a discrepancy is provided to a user inresponse to a discrepancy between the user-specified and derived datalineages being detected, in other embodiments, one or more automatedactions may be taken to address the discrepancy. For example, in someembodiments, the derived data lineage may be refreshed in an effort toeliminate the discrepancy. As another example, in some embodiments, thedata processing system executing process 500 may change theuser-specified data lineage to be consistent with the derived datalineage. As yet another example, the data processing system may use theuser-specified data lineage to help it to obtain a new derived datalineage.

As another example of a variation of process 500, it should beappreciated that not all of the acts of process 500 are required inevery embodiment. For example, in some embodiments, any one or more ofacts 508-514 may be optional. For instance, in some embodiments, process500 may proceed without performing acts 508 and/or 509.

FIGS. 6A-6E show some additional illustrative examples of graphical userinterfaces that may be used in connection with some embodiments of thetechnology described herein. The graphical user interfaces of FIGS.6A-6E provide information about the business data element “creditscore,” which may represent the credit score of a bank customer.

As described herein, a data processing system may maintain informationabout which parties are accountable for management of a business dataelement. As an example of this, the illustrative graphical userinterface 600 of FIG. 6A, identifies four individuals (including abusiness owner 602, data steward 604, and two subject matter experts 606and 608) accountable for management of the “credit score” business dataelement 601.

FIGS. 6B and 6C provide information about the derived data lineage forthe physical data element corresponding to the “credit score” businessdata element 601. The graphical user interface 610 of FIG. 6B shows alisting 612 of the systems involved in generating the physical dataelement corresponding to the business data element 601.

FIG. 6C is an illustrative user interface presenting a derived datalineage 630 for the business data element 601 “credit score.” As shownin FIG. 6C, the physical data element corresponding to the business dataelement 601 is stored in feed 621 within risk datamart 622. The physicaldata elements in feed 621 are obtained using physical data elementsstored in storage 623 of customer data warehouse 624. The physical dataelements in storage 623 are obtained using physical data elements infeeds 626, which in turn are obtained from physical data elements storedin systems 628, 630, and 632.

FIG. 6D is an illustrative user interface 640 presenting information inthe stated data lineage for the business data element 601 “creditscore.” As shown in the interface, the user-specified (“stated”) sourceof the data used to obtain the physical data element associated with thecredit score business data element 601 is “External Data” 642.

FIG. 6E is an illustrative user interface 650 indicating presence of adiscrepancy between the user-specified and derived lineages for thebusiness data element 601. As shown in the interface 650, theuser-specified (“stated”) source of the data used to obtain the physicaldata element associated with the credit score business data element 601is “External Data” 652. However, according to the derived data lineage,the source for this physical data element is “U.S. Origination Systems”654. As can be seen from FIG. 6E, the user interface 650 presents thediscrepancy between the user-specified and derived lineages to the userby showing (through checkmarks in boxes) that the stated and derivedsources for the physical data element corresponding to business dataelement 601 do not match.

FIG. 8A is a diagram of an illustrative user interface presenting auser-specified data lineage 800 for the business data element “TotalCredit Exposure” contained in the report “Consumer Exposure Report,”represented by node 808. The user-specified data lineage 800 indicates,among other thing, the following:

-   -   (1) the inputs to the business data element “Total Credit        Exposure” are “Credit Score” and “Outstanding Loan Amount,” both        of which are in a database system called “Risk Datamart,”        represented by node 806, and are aggregated as inputs to the        “Total Credit Exposure” business data element;    -   (2) the “Credit Score” business data element in the Risk        Datamart database has a table column input, in the same        database, which goes through a transformation to sort the credit        scores into bands;    -   (3) the “Credit Score” table column in an application called        Customer Data Warehouse (CDW), represented by node 804, is a        pass through input to the “Credit Score” table column in “Risk        Datamart” and is checked by an automatic control called “Credit        Score Check” shown by the checkbox along link 805 between nodes        804 and 806; and    -   (4) the contents of the “Credit Score” table column in the CDW        application depend on data coming from each of three different        originating systems: Canada Origination Systems represented by        node 802 a, Mexico Origination Systems represented by node 802        c, and US Origination Systems represented by node 802 d, as well        as a third-party application “Credit Bureau Data,” represented        by node 802 b.

As may be appreciated from the foregoing, in user-specified data lineage800, the various nodes represent different systems, applications, adatabase, and a report. The links between the nodes represent flows ofdata, which is why they are sometimes called “flows.” In theuser-specified data lineage 800, the links 803 a-d represent respectiveflows from nodes 802 a-d to node 804, link 805 represents the flow ofdata from node 804 to node 806, and link 807 represents a flow of datafrom node 806 to node 808. Note that each of the links in user-specifiedlineage 800 indicates not only a flow of data between nodes, but alsoindicates a dependency among the business data elements containedtherein. For example, link 803 a indicates that the data in “CreditScore” table in the CDW application represented by node 804 depends onthe “Credit Score” table in Canada Origination systems represented bynode 802a. A link in the user-specified lineage is indicative of a datadependency. A link from business data element A to business element Bindicates that business element B depends on business element A.

As shown in FIG. 8A, some of the links are indicated using thick lines(e.g., links 803 a, 803 c and 803 d), some of the links are indicatedusing dashed lines (e.g., link 803 b), and some of the links areindicated using thin lines (e.g., link 807). In some embodiments, a thinline link indicates that the dependency represented by the link in auser-specified lineage has no corresponding dependency (e.g.,represented by one or more links) in a derived data lineage. Forexample, a link between two business data elements may be shown by athin line when there is no dependency, in the derived data lineage,between two physical data elements corresponding to the two businessdata elements. As one illustrative example, in FIG. 8F, the dependencyGUI element 840 for link 807 shows with a checkmark near the “Stated”field 842 that the dependency represented by link 807 was specified by auser, but the lack of a checkmark near the “Derived” field 842 indicatesthat there is no corresponding dependency in the derived data lineageassociated with the user-specified data lineage. In this way, a thinline link in a user-specified lineage may indicate the presence of adisparity between the user-specified lineage and the associated deriveddata lineage. Such a disparity may be detected using an associationbetween the user-specified data lineage and a derived data lineage inaccordance with the techniques described herein.

In some embodiments, a thick line link indicates that the dependencyrepresented by the link in a user-specified lineage, has a correspondingdependency (e.g., represented by one or more links) in a derived datalineage. For example, a link between two business data elements may beshown by a thick line when there is a corresponding dependency, in thederived data lineage, between two physical data elements correspondingto the two business data elements. As one illustrative example, in FIG.8B, the dependency GUI element 810 for link 803 a shows: (1) with acheckmark near the “Stated” field 812 that the dependency represented bylink 803 a was specified by a user; and (2) with a checkmark near the“Derived” field 814 that there is a corresponding dependency in thederived data lineage associated with the user-specified data lineage800. Clicking on GUI element 816 reveals this corresponding dependencybetween nodes 822 and 824 (through node 824) in the derived data lineage820 shown in FIG. 8C. In this way, a thick line link in a user-specifiedlineage may indicate agreement or correspondence between theuser-specified lineage and the derived data lineage. Such an agreementor correspondence may be detected using an association between theuser-specified data lineage and a derived data lineage in accordancewith the techniques described herein.

In some embodiments, a dashed-line link (e.g., link 803 a in FIG. 8A)may indicate that the dependency is on data provided by a third-partyapplication.

In some embodiments, a graphical user interface showing a user-specifieddata lineage may also show one or more control check GUI elements, whichmay provide credibility for assertions made by a user creating theuser-specified lineage. For example, as shown in FIG. 8A, control checkGUI elements are represented as circled letters, where the circled Vindicates that the node passed the validity control check, and thecircled A indicates that the node passed the accuracy control check.Additionally or alternative, a graphical indication that a data qualitycontrol check was passed may be provided. Control check GUI elements mayapply to both nodes and links/flows. For example, a check box on thelink 805 indicates that a check on one or more credit scores wasperformed.

FIG. 8D is a diagram of an illustrative user interface presentinginformation about a node in the user-specified data lineage of FIG. 8A,in accordance with some embodiments of the technology described herein.As shown in FIG. 8D, panel 825 is showing additional informationassociated with business data element “Credit Score” and includes a linkto the corresponding physical data element “credit_score,” the linkbeing indicated by reference numeral 830. This provides another view ofthe association between the user-specified and derived data lineages.Clicking on the link for physical data element “credit_score” indicatedby the reference numeral 830 provides further information about thephysical data element, for example, as shown in panel 835 of FIG. 8E.Further clicking on the GUI element 836 shown in FIG. 8E, will show atleast a portion of a derived data lineage containing the physical dataelement “credit_score”.

FIG. 7 illustrates an example of a suitable computing system environment700 on which the technology described herein may be implemented. Thecomputing system environment 700 is only one example of a suitablecomputing environment and is not intended to suggest any limitation asto the scope of use or functionality of the technology described herein.Neither should the computing environment 700 be interpreted as havingany dependency or requirement relating to any one or combination ofcomponents illustrated in the exemplary operating environment 700.

The technology described herein is operational with numerous othergeneral purpose or special purpose computing system environments orconfigurations. Examples of well-known computing systems, environments,and/or configurations that may be suitable for use with the technologydescribed herein include, but are not limited to, personal computers,server computers, hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of the above systemsor devices, and the like.

The computing environment may execute computer-executable instructions,such as program modules. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Thetechnology described herein may also be practiced in distributedcomputing environments where tasks are performed by remote processingdevices that are linked through a communications network. In adistributed computing environment, program modules may be located inboth local and remote computer storage media including memory storagedevices.

With reference to FIG. 7, an exemplary system for implementing thetechnology described herein includes a general purpose computing devicein the form of a computer 710. Components of computer 710 may include,but are not limited to, a processing unit 720, a system memory 730, anda system bus 721 that couples various system components including thesystem memory to the processing unit 720. The system bus 721 may be anyof several types of bus structures including a memory bus or memorycontroller, a peripheral bus, and a local bus using any of a variety ofbus architectures. By way of example, and not limitation, sucharchitectures include Industry Standard Architecture (ISA) bus, MicroChannel Architecture (MCA) bus, Enhanced ISA (EISA) bus, VideoElectronics Standards Association (VESA) local bus, and PeripheralComponent Interconnect (PCI) bus also known as Mezzanine bus.

Computer 710 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 710 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes volatile andnonvolatile, removable and non-removable media implemented in any methodor technology for storage of information such as computer readableinstructions, data structures, program modules or other data. Computerstorage media includes, but is not limited to, RAM, ROM, EEPROM, flashmemory or other memory technology, CD-ROM, digital versatile disks (DVD)or other optical disk storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the desired information and which canaccessed by computer 710. Communication media typically embodiescomputer readable instructions, data structures, program modules orother data in a modulated data signal such as a carrier wave or othertransport mechanism and includes any information delivery media. Theterm “modulated data signal” means a signal that has one or more of itscharacteristics set or changed in such a manner as to encode informationin the signal. By way of example, and not limitation, communicationmedia includes wired media such as a wired network or direct-wiredconnection, and wireless media such as acoustic, RF, infrared and otherwireless media. Combinations of the any of the above should also beincluded within the scope of computer readable media.

The system memory 730 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 731and random access memory (RAM) 732. A basic input/output system 733(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 710, such as during start-up, istypically stored in ROM 731. RAM 732 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 720. By way of example, and notlimitation, FIG. 7 illustrates operating system 734, applicationprograms 735, other program modules 736, and program data 737.

The computer 710 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 7 illustrates a hard disk drive 741 that reads from or writes tonon-removable, nonvolatile magnetic media, a flash drive 751 that readsfrom or writes to a removable, nonvolatile memory 752 such as flashmemory, and an optical disk drive 755 that reads from or writes to aremovable, nonvolatile optical disk 756 such as a CD ROM or otheroptical media. Other removable/non-removable, volatile/nonvolatilecomputer storage media that can be used in the exemplary operatingenvironment include, but are not limited to, magnetic tape cassettes,flash memory cards, digital versatile disks, digital video tape, solidstate RAM, solid state ROM, and the like. The hard disk drive 741 istypically connected to the system bus 721 through a non-removable memoryinterface such as interface 740, and magnetic disk drive 751 and opticaldisk drive 755 are typically connected to the system bus 721 by aremovable memory interface, such as interface 750.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 7, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 710. In FIG. 7, for example, hard disk drive 741 is illustratedas storing operating system 744, application programs 745, other programmodules 746, and program data 747. Note that these components can eitherbe the same as or different from operating system 734, applicationprograms 735, other program modules 736, and program data 737. Operatingsystem 744, application programs 745, other program modules 746, andprogram data 747 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 710 through input devices such as akeyboard 762 and pointing device 761, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit720 through a user input interface 760 that is coupled to the systembus, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). A monitor791 or other type of display device is also connected to the system bus721 via an interface, such as a video interface 790. In addition to themonitor, computers may also include other peripheral output devices suchas speakers 797 and printer 796, which may be connected through anoutput peripheral interface 795.

The computer 710 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer780. The remote computer 780 may be a personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 710, although only a memory storage device 781 has beenillustrated in FIG. 7. The logical connections depicted in FIG. 7include a local area network (LAN) 771 and a wide area network (WAN)773, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

When used in a LAN networking environment, the computer 710 is connectedto the LAN 771 through a network interface or adapter 770. When used ina WAN networking environment, the computer 710 typically includes amodem 772 or other means for establishing communications over the WAN773, such as the Internet. The modem 772, which may be internal orexternal, may be connected to the system bus 721 via the user inputinterface 760, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 710, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 7 illustrates remoteapplication programs 785 as residing on memory device 781. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

Having thus described several aspects of at least one embodiment of thisinvention, it is to be appreciated that various alterations,modifications, and improvements will readily occur to those skilled inthe art.

Such alterations, modifications, and improvements are intended to bepart of this disclosure, and are intended to be within the spirit andscope of the invention. Further, though advantages of the presentinvention are indicated, it should be appreciated that not everyembodiment of the technology described herein will include everydescribed advantage. Some embodiments may not implement any featuresdescribed as advantageous herein and in some instances one or more ofthe described features may be implemented to achieve furtherembodiments. Accordingly, the foregoing description and drawings are byway of example only.

The above-described embodiments of the technology described herein canbe implemented in any of numerous ways. For example, the embodiments maybe implemented using hardware, software or a combination thereof. Whenimplemented in software, the software code can be executed on anysuitable processor or collection of processors, whether provided in asingle computer or distributed among multiple computers. Such processorsmay be implemented as integrated circuits, with one or more processorsin an integrated circuit component, including commercially availableintegrated circuit components known in the art by names such as CPUchips, GPU chips, microprocessor, microcontroller, or co-processor.Alternatively, a processor may be implemented in custom circuitry, suchas an ASIC, or semicustom circuitry resulting from configuring aprogrammable logic device. As yet a further alternative, a processor maybe a portion of a larger circuit or semiconductor device, whethercommercially available, semi-custom or custom. As a specific example,some commercially available microprocessors have multiple cores suchthat one or a subset of those cores may constitute a processor. However,a processor may be implemented using circuitry in any suitable format.

Further, it should be appreciated that a computer may be embodied in anyof a number of forms, such as a rack-mounted computer, a desktopcomputer, a laptop computer, or a tablet computer. Additionally, acomputer may be embedded in a device not generally regarded as acomputer but with suitable processing capabilities, including a PersonalDigital Assistant (PDA), a smart phone or any other suitable portable orfixed electronic device.

Also, a computer may have one or more input and output devices. Thesedevices can be used, among other things, to present a user interface.Examples of output devices that can be used to provide a user interfaceinclude printers or display screens for visual presentation of outputand speakers or other sound generating devices for audible presentationof output. Examples of input devices that can be used for a userinterface include keyboards, and pointing devices, such as mice, touchpads, and digitizing tablets. As another example, a computer may receiveinput information through speech recognition or in other audible format.

Such computers may be interconnected by one or more networks in anysuitable form, including as a local area network or a wide area network,such as an enterprise network or the Internet. Such networks may bebased on any suitable technology and may operate according to anysuitable protocol and may include wireless networks, wired networks orfiber optic networks.

Also, the various methods or processes outlined herein may be coded assoftware that is executable on one or more processors that employ anyone of a variety of operating systems or platforms. Additionally, suchsoftware may be written using any of a number of suitable programminglanguages and/or programming or scripting tools, and also may becompiled as executable machine language code or intermediate code thatis executed on a framework or virtual machine.

In this respect, the invention may be embodied as a computer readablestorage medium (or multiple computer readable media) (e.g., a computermemory, one or more floppy discs, compact discs (CD), optical discs,digital video disks (DVD), magnetic tapes, flash memories, circuitconfigurations in Field Programmable Gate Arrays or other semiconductordevices, or other tangible computer storage medium) encoded with one ormore programs that, when executed on one or more computers or otherprocessors, perform methods that implement the various embodiments ofthe invention discussed above. As is apparent from the foregoingexamples, a computer readable storage medium may retain information fora sufficient time to provide computer-executable instructions in anon-transitory form. Such a computer readable storage medium or mediacan be transportable, such that the program or programs stored thereoncan be loaded onto one or more different computers or other processorsto implement various aspects of the present invention as discussedabove. As used herein, the term “computer-readable storage medium”encompasses only a non-transitory computer-readable medium that can beconsidered to be a manufacture (i.e., article of manufacture) or amachine. Alternatively or additionally, the invention may be embodied asa computer readable medium other than a computer-readable storagemedium, such as a propagating signal.

The terms “program” or “software” are used herein in a generic sense torefer to any type of computer code or set of computer-executableinstructions that can be employed to program a computer or otherprocessor to implement various aspects of the present invention asdiscussed above. Additionally, it should be appreciated that accordingto one aspect of this embodiment, one or more computer programs thatwhen executed perform methods of the present invention need not resideon a single computer or processor, but may be distributed in a modularfashion amongst a number of different computers or processors toimplement various aspects of the present invention.

Computer-executable instructions may be in many forms, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, objects, components, datastructures, etc. that perform particular tasks or implement particularabstract data types. Typically the functionality of the program modulesmay be combined or distributed as desired in various embodiments.

Also, data structures may be stored in computer-readable media in anysuitable form. For simplicity of illustration, data structures may beshown to have fields that are related through location in the datastructure. Such relationships may likewise be achieved by assigningstorage for the fields with locations in a computer-readable medium thatconveys relationship between the fields. However, any suitable mechanismmay be used to establish a relationship between information in fields ofa data structure, including through the use of pointers, tags or othermechanisms that establish relationship between data elements.

Various aspects of the present invention may be used alone, incombination, or in a variety of arrangements not specifically discussedin the embodiments described in the foregoing and is therefore notlimited in its application to the details and arrangement of componentsset forth in the foregoing description or illustrated in the drawings.For example, aspects described in one embodiment may be combined in anymanner with aspects described in other embodiments.

Also, the invention may be embodied as a method, of which an example hasbeen provided. The acts performed as part of the method may be orderedin any suitable way. Accordingly, embodiments may be constructed inwhich acts are performed in an order different than illustrated, whichmay include performing some acts simultaneously, even though shown assequential acts in illustrative embodiments.

Further, some actions are described as taken by a “user.” It should beappreciated that a “user” need not be a single individual, and that insome embodiments, actions attributable to a “user” may be performed by ateam of individuals and/or an individual in combination withcomputer-assisted tools or other mechanisms.

Use of ordinal terms such as “first,” “second,” “third,” etc., in theclaims to modify a claim element does not by itself connote anypriority, precedence, or order of one claim element over another or thetemporal order in which acts of a method are performed, but are usedmerely as labels to distinguish one claim element having a certain namefrom another element having a same name (but for use of the ordinalterm) to distinguish the claim elements.

Also, the phraseology and terminology used herein is for the purpose ofdescription and should not be regarded as limiting. The use of“including,” “comprising,” or “having,” “containing,” “involving,” andvariations thereof herein, is meant to encompass the items listedthereafter and equivalents thereof as well as additional items.

What is claimed is:
 1. At least one non-transitory computer-readablestorage medium storing processor-executable instructions that, whenexecuted by at least one computer hardware processor, cause the at leastone computer hardware processor to perform: obtaining a first datalineage representing relationships among a plurality of physical dataelements; obtaining, based at least in part on user input, a second datalineage representing relationships among a plurality of business dataelements; obtaining an association between at least some of theplurality of physical data elements of the first data lineage and atleast some of the plurality of business data elements of the second datalineage; and generating, based on the association between the pluralityof physical data elements and the plurality of business data elements,an indication of agreement or discrepancy between the first data lineageand the second data lineage.
 2. The at least one non-transitorycomputer-readable storage medium of claim 1, wherein generating theindication of agreement or discrepancy comprises: displaying avisualization of the second data lineage showing the indication ofagreement or discrepancy.
 3. The at least one non-transitorycomputer-readable storage medium of claim 2, wherein the second datalineage comprises a first link representing a first dependency betweentwo business data elements, and wherein displaying the visualization ofthe second data lineage comprises displaying the link in one manner whenthere is a dependency in the first data lineage corresponding to thefirst dependency and in another manner when there is not a dependency inthe first data lineage corresponding to the first dependency.
 4. The atleast one non-transitory computer-readable storage medium of claim 1,wherein obtaining the first data lineage comprises analyzing source codeof at least one computer program configured to access at least some ofthe plurality of physical data elements.
 5. The at least onenon-transitory computer-readable storage medium of claim 1, whereinobtaining the first data lineage comprises analyzing informationobtained during runtime of at least one computer program configured toaccess at least some of the plurality of physical data elements.
 6. Theat least one non-transitory computer-readable storage medium of claim 1,wherein generating the indication of agreement or discrepancy comprises:determining, based on the association between the plurality of physicaldata elements and the plurality of business data elements, whether thereis one or more discrepancies among the first data lineage, the seconddata lineage, and the obtained association.
 7. The at least onenon-transitory computer-readable storage medium of claim 6, wherein theplurality of physical data elements comprises a first physical dataelement, wherein the plurality of business data elements comprises afirst business data element, wherein the association indicates that thefirst physical data element and the first business data element areassociated, and wherein the determining comprises determining that afirst set of one or more sources of data identified in the first datalineage as being used to obtain the first physical data element isdifferent from a second set of one or more sources of data identified inthe second data lineage as being used to obtain the first business dataelement.
 8. The at least one non-transitory computer-readable storagemedium of claim 6, wherein acts of obtaining the first data lineage anddetermining whether there is a discrepancy are performed repeatedlyaccording to a specified schedule.
 9. The at least one non-transitorycomputer-readable storage medium of claim 1, wherein obtaining the firstdata lineage comprises generating the first data lineage at least inpart by performing at least one of analyzing source code of at least onecomputer program configured to access at least some of the plurality ofphysical data elements and analyzing information obtained during runtimeof the at least one computer program.
 10. The at least onenon-transitory computer-readable storage medium of claim 9, wherein theat least one computer program comprises a computer program implementedas a dataflow graph.
 11. The at least one non-transitorycomputer-readable storage medium of claim 1, wherein obtaining theassociation between the at least some of the plurality of physical dataelements of the first data lineage and the at least some of theplurality of business data elements of the second data lineage comprisesgenerating the association based on user input provided via a graphicaluser interface.
 12. The at least one non-transitory computer-readablestorage medium of claim 1, wherein the association comprises anassociation between a first physical data element of the plurality ofphysical data elements and a first business data element of theplurality of business data elements, and wherein the at least onecomputer hardware processor is further configured to perform:determining, based at least in part on the association between the firstphysical data element and the first business data element, a measure ofdata quality for the first business data element.
 13. The at least onenon-transitory computer-readable storage medium of claim 12, whereindetermining the measure of data quality for the first business dataelement comprises: performing an analysis of data quality of data in thefirst physical data element based at least in part on one or more dataquality rules associated with the data in the first physical dataelement.
 14. The at least one non-transitory computer-readable storagemedium of claim 12, wherein the measure of data quality for the firstbusiness element includes a measure of one or more of accuracy,completeness, and validity.
 15. At least one non-transitorycomputer-readable storage medium storing processor executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining a first data lineage representing relationships amonga plurality of physical data elements; obtaining, based at least in parton user input, a second data lineage representing relationships among aplurality of business data elements; obtaining an association between atleast some of the plurality of physical data elements of the first datalineage and at least some of the plurality of business data elements ofthe second data lineage, the association including an associationbetween a first physical data element of the plurality of physical dataelements and a first business data element of the plurality of businessdata elements; and determining a measure of data quality for the firstbusiness data element based at least in part on at least one dataquality measure associated with the first physical data element and theassociation between the first physical data element and the firstbusiness data element.
 16. The at least one non-transitorycomputer-readable storage medium of claim 15, wherein determining themeasure of data quality for the first business data element comprises:performing an analysis of data quality of data in the first physicaldata element based at least in part on one or more data quality rulesassociated with the data in the first physical data element to obtainthe at least one data quality measure associated with the first physicaldata element.
 17. The at least one non-transitory computer-readablestorage medium of claim 15, wherein the measure of data quality for thefirst business element includes a measure of one or more of accuracy,completeness, and validity.
 18. The at least one non-transitorycomputer-readable storage medium of claim 15 wherein obtaining the firstdata lineage comprises receiving the first data lineage after it hasbeen generated.
 19. The at least one non-transitory computer-readablestorage medium of claim 15, wherein obtaining the first data lineagecomprises generating the first data lineage.
 20. The at least onenon-transitory computer-readable storage medium of claim 19, whereingenerating the first data lineage comprises analyzing source code of atleast one computer program configured to access at least some of theplurality of physical data elements. 21-29. (canceled)